So far we’ve talked about inference through the lens of Parameter Estimation.
Parameter Estimation: inference methods that use data to determine the value of population parameters (e.g. a regression coefficient, group mean, mean difference, proportion…)
Hypothesis Testing: inference methods that use data to support a particular theory/hypothesis
Hypothesis Testing
While “Bayes vs. Frequentist?” is an important question to ask when choosing statistical tools, “Parameter Estimation vs. Hypothesis Testing” is even more important.
Parameter Estimation
Hypothesis Testing
Bayesian
Frequentist
Hypothesis Testing
My (Chelsea’s) Personal Claim: People often misuse Hypothesis testing in situations where their questions are better answered by Parameter Estimation.
Hypotheses
My mean crossword time is faster than yours (\(\mu_{me} \lt \mu_{you}\))
There is no effect of corgi height on corgi weight is (\(\beta_1 = 0\))
Drug A’s reduction in cold symptoms is equivalent to Drug B’s (\(\mu_{A} = \mu_{B}\))
There is no difference in the mean anxiety of Joy Group A and Joy Group B (\(\mu_{A} = \mu_{B}\))
Parameter Estimation
The estimate of my mean crossword time is \(25.89 \pm 2\)
The regression coefficient of corgi height on corgi weight is between \([-0.1, 0.26]\)
The mean difference between the reduction in cold symptoms for groups A and B is \(1.22 (0.01,2.45)\)
The mean difference between the anxiety ratings for groups A and B is \(-0.2\) with a standard error of \(0.05\)
Null Hypothesis Significance Testing
Reductio Ad Absurdum
Note: NHST is NOT the only way to be a Frequentist. Often, critiques of Frequentism are actually critiques of NHST.
Reductio Ad Absurdum: (Reduction to Absurdity) is a form of proof by contradiction. We want to prove X, assume not X, show that it leads to a false, ridiculous, or highly unlikely outcome. Therefore X.
Example:
Claim: there is no smallest rational (can be represented as a fraction \(\frac{g}{n}\)) positive number
Assume Contradiction: there is a smallest rational number \(q\)
RAA: since \(q\) is positive, \(\frac{q}{2}\) is a rational number (literally repping it with a fraction rn) and \(\frac{q}{2} \lt q\)
Conclusion: there is no smallest rational number
Reductio Ad Absurdum
Claim: there is no town with a local barber who shaves all and only those who do not shave themselves.
Prove this with RAA.
Null Hypothesis Significance Testing
Null Hypothesis: Any hypothesis of “no effect”
Significance Testing: rejecting or failing to reject a hypothesis
Null Hypotheses
The regression coefficient of IQ’s effect on Income is 0 \(\beta_{iq} = 0\)
The mean difference between the GPA of EECS and CADS students is 0 \(d_{e-c} = 0\)
The proportion of heads on this coin is 0.5 \(p = 0.5\)
All of the above assume “no effect”
Test Statistics
Test-Statistic: a summary of the data calculated using a sample. \(f(x)\)
First of all, I am very proud to have taught you this much about Statistical Inference without once mentioning p-values. But alas. It is time.
P-Values
P-values: \(p(\text{data} \mid H_0)\); assuming the null is true and there’s no effect, what is the probability of observing a test-statistic as or more extreme than the one we calculated from our data.
Null Sampling Distribution
Notice, this Sampling distribution is not centered around \(\hat{\theta}\), but our null value \(\theta_0\). The standard error is still calculated as \(\frac{\sigma}{\sqrt{n}}\). Defines the sample estimates \(\hat{\theta}\) we’d expect if we repeatedly sampled from the null.
P-Values
P-values: \(p(\text{data} \mid H_0)\); assuming the null is true and there’s no effect, what is the probability of observing a test-statistic as or more extreme than the one we calculated from our data.
Directional P-Values
Directional Null: \(\mu \geq 0\)
Non-Directional Null: \(\mu = 0\)
Fisherian Hypothesis Testing
💡 p-values are a continuous measure of evidence against\(H_0\)
❓ answers the question “is the observed data consistent with \(H_0\)”
Fisherian Hypothesis Testing
Choose an appropriate test
Define \(H_0\)
Calculate p-value
Assess Significance
the lower the p-value, the stronger the evidence against the null
Fisherian Hypothesis Testing
We’re testing the hypothesis that SuperSmartizine™️ increases IQ. We give SuperSmartizine™️ to 25 people and measured their IQs. The mean IQ is 104.5
Assess Significance: this is not strong evidence against the null. We’d expect sample means as or more extreme than this about 10% of the time under repeated samples from the null
Fisherian Hypothesis Testing
We’re testing the hypothesis that the effect of minutes of exercise on heart attack \(\beta_{exercise}\) is different than 0. We collect 1000 data points, and fit a logistic regression model \(\text{heart_attack} \sim \text{age} + \text{sex} + \text{exercise_minutes}\). The coefficient is: \(-0.002505\)
Choose an appropriate test:t-test
Define \(H_0\): \(\beta_{exercise} = 0\)
Calculate p-value: computer does this for us tbh; \(0.000747\)
Assess Significance: this is strong evidence against the null. We’d expect sample means as or more extreme than this about 0.0747% of the time under repeated samples from the null.
Fisherian Hypothesis Testing
lr <-glm(heart_attack ~ age + sex + exercise_minutes, data = data,family = binomial)summary(lr)
Call:
glm(formula = heart_attack ~ age + sex + exercise_minutes, family = binomial,
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3199939 0.2798233 1.144 0.252807
age 0.0043040 0.0045122 0.954 0.340157
sex 0.2871836 0.1294311 2.219 0.026499 *
exercise_minutes -0.0025059 0.0007432 -3.372 0.000747 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1361.2 on 999 degrees of freedom
Residual deviance: 1344.9 on 996 degrees of freedom
AIC: 1352.9
Number of Fisher Scoring iterations: 4
Studies in Crop Variation
Statistical Methods for Research Workers
“The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.”
Justify Your \(\alpha\)
https://osf.io/preprints/psyarxiv/ts4r6
Neyman-Pearson Significance Testing
💡 make a decision about whether you will act as if\(H_0\) is false while controlling your long run error rates
❓ answers the question “is the observed data extreme enough for us to reject \(H_0\)”
Neyman-Pearson Significance Testing
Null Hypothesis: Any hypothesis of “no effect” (\(H_0\))
Alternative Hypothesis: The opposite of the Null, there is an effect (\(H_1\) or \(H_A\))
Fail to Reject H0
Reject H0
H0 True
Correct
Type I Error; FP
H1 True
Type II Error; FN
Correct
Neyman-Pearson Significance Testing
\(H_0: \mu > 0\)
Neyman-Pearson Significance Testing
Fail to Reject\(H_0\): we have not provided evidence that \(H_0\) is false, we will not act as if it’s false
Reject\(H_0\): we have provided evidence that \(H_0\) is false, we will act as if it’s false
Neyman-Pearson Significance Testing
These four outcomes all have defined probabilities.
Fail to Reject H0
Reject H0
H0 True
\(1- \alpha\)
\(\alpha\)
H1 True
\(\beta\)
\(1-\beta\)
Remember: we get to choose \(\alpha\) directly
Neyman-Pearson Significance Testing
Choose an appropriate test
Define \(H_0\) and \(H_A\)
Calculate test-statistic and critical value
Assess Significance
if our test statistic is more extreme than our critical value, we will act as if the \(H_0\) is false
Neyman-Pearson Significance Testing
We are testing the hypothesis that the sample proportion of Chapman students who voted is different than the US proportion or \(0.66\). We polled 100 Chapman students and 75% (0.75) of them voted.
Choose an appropriate test: one sample z-test for proportions
Define \(H_0\) and \(H_A\):
\(H_0\): \(p_{chap} = 0.66\)
\(H_A\): \(p_{chap} \neq 0.66\)
Calculate test-statistic and critical value: \(z = \frac{0.75-0.66}{se} = 1.9\); critical value is \(1.96\) when \(\alpha = 0.05\)
Assess Significance: we fail to reject \(H_0\), and will not act as if\(H_0\) is false.
Z-statistic: 1.899901
P-value: 0.05744605
Neyman-Pearson Significance Testing
We are testing the hypothesis that the coefficient for the effect of age on stress levels \(\beta_{age}\) is not 0.
Choose an appropriate test: t-test
Define \(H_0\) and \(H_A\)
\(H_0: \beta_{age} = 0\)
\(H_A: \beta_{age} \neq 0\)
Calculate test-statistic and critical value: t-statistic = \(2.465\), critical value with \(\alpha = 0.05\) and \(df = 98\) is \(1.984\)
Assess Significance: We reject \(H_0\), we will act as if\(H_0\) is false, and assume \(\beta_{age} \neq 0\)
Neyman-Pearson Significance Testing
# Run linear regression modelmodel <-lm(stress_level ~ age,data = data)# Summarize the modelsummary(model)
Call:
lm(formula = stress_level ~ age, data = data)
Residuals:
Min 1Q Median 3Q Max
-22.357 -6.099 -0.210 5.956 22.156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.3127 3.1666 9.573 1.02e-15 ***
age 0.1795 0.0728 2.465 0.0154 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.692 on 98 degrees of freedom
Multiple R-squared: 0.0584, Adjusted R-squared: 0.04879
F-statistic: 6.078 on 1 and 98 DF, p-value: 0.01543
Power Analysis
If there is an effect, how likely are you to detect it (\(\beta\))?
Fail to Reject H0
Reject H0
H0 True
\(1-\alpha\) Correct
\(\alpha\) Type I Error
H1 True
\(\beta\) Type II Error
\(1-\beta\) Power
Power Analysis
Power Analysis
❓ What are things we could change that would increase our statistical power?